multimodal feature
From Raw Features to Effective Embeddings: A Three-Stage Approach for Multimodal Recipe Recommendation
Shin, Jeeho, Kim, Kyungho, Shin, Kijung
Recipe recommendation has become an essential task in web-based food platforms. A central challenge is effectively leveraging rich multimodal features beyond user-recipe interactions. Our analysis shows that even simple uses of multimodal signals yield competitive performance, suggesting that systematic enhancement of these signals is highly promising. We propose TESMR, a 3-stage framework for recipe recommendation that progressively refines raw multimodal features into effective embeddings through: (1) content-based enhancement using foundation models with multimodal comprehension, (2) relation-based enhancement via message propagation over user-recipe interactions, and (3) learning-based enhancement through contrastive learning with learnable embeddings. Experiments on two real-world datasets show that TESMR outperforms existing methods, achieving 7-15% higher Recall@10.
- Asia > South Korea > Daejeon > Daejeon (0.40)
- Asia > South Korea > Seoul > Seoul (0.05)
- Research Report (0.64)
- Overview (0.47)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
M2R2: MultiModal Robotic Representation for Temporal Action Segmentation
Sliwowski, Daniel, Lee, Dongheui
Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel pretraining strategy that enables the reuse of learned features across multiple TAS models. Our method achieves state-of-the-art performance on the REASSEMBLE dataset, a challenging multimodal robotic assembly dataset, outperforming existing robotic action segmentation models by 46.6%. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.
- Europe > Austria > Vienna (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States (0.04)
- Europe > Germany (0.04)
Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation
Wanyan, Yuyang, Yang, Xiaoshan, Dong, Weiming, Xu, Changsheng
Abstract--In this paper, we study the challenging task of Few-Shot Video Domain Adaptation (FSVDA). The multimodal nature of videos introduces unique challenges, necessitating the simultaneous consideration of both domain alignment and modality collaboration in a few-shot scenario, which is ignored in previous literature. We observe that, under the influence of domain shift, the generalization performance on the target domain of each individual modality, as well as that of fused multimodal features, is constrained. Because each modality is comprised of coupled features with multiple components that exhibit different domain shifts. This variability increases the complexity of domain adaptation, thereby reducing the effectiveness of multimodal feature integration. T o address these challenges, we introduce a novel framework of Modality-Collaborative Low-Rank Decomposers (MC-LRD) to decompose modality-unique and modality-shared features with different domain shift levels from each modality that are more friendly for domain alignment. The MC-LRD comprises multiple decomposers for each modality and Multimodal Decomposition Routers (MDR). Each decomposer has progressively shared parameters across different modalities. The MDR is leveraged to selectively activate the decomposers to produce modality-unique and modality-shared features. T o ensure efficient decomposition, we apply orthogonal decorrelation constraints separately to decomposers and sub-routers, enhancing their diversity. Furthermore, we propose a cross-domain activation consistency loss to guarantee that target and source samples of the same category exhibit consistent activation preferences of the decomposers, thereby facilitating domain alignment. Extensive experimental results on three public benchmarks demonstrate that our model achieves significant improvements over existing methods.
- Europe > Switzerland > Basel-City > Basel (0.05)
- Asia > China > Beijing > Beijing (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM
Hori, Chiori, Masuyama, Yoshiki, Jain, Siddarth, Corcodel, Radu, Jha, Devesh, Romeres, Diego, Roux, Jonathan Le
Abstract--Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former . Experiments with the Y ouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.
UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models
In this work, we focus on a new and practical task of data pre-selection for data-efficient visual object recognition (Fig.1-a). The goal of data pre-selection is to select instances for labeling from an unlabeled dataset through a single pass to maximize model performance for unknown downstream vision tasks (e.g., no knowledge about
- North America > Canada (0.04)
- Asia > Japan (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation
Tan, Bin, Ge, Wangyao, Wang, Yidi, Liu, Xin, Burtoft, Jeff, Fan, Hao, Wang, Hui
Modern app store recommender systems struggle with multiple-category apps, as traditional taxonomies fail to capture overlapping semantics, leading to suboptimal personalization. We propose PCR-CA (Parallel Codebook Representations with Contrastive Alignment), an end-to-end framework for improved CTR prediction. PCR-CA first extracts compact multimodal embeddings from app text, then introduces a Parallel Codebook VQ-AE module that learns discrete semantic representations across multiple codebooks in parallel -- unlike hierarchical residual quantization (RQ-VAE). This design enables independent encoding of diverse aspects (e.g., gameplay, art style), better modeling multiple-category semantics. To bridge semantic and collaborative signals, we employ a contrastive alignment loss at both the user and item levels, enhancing representation learning for long-tail items. Additionally, a dual-attention fusion mechanism combines ID-based and semantic features to capture user interests, especially for long-tail apps. Experiments on a large-scale dataset show PCR-CA achieves a +0.76% AUC improvement over strong baselines, with +2.15% AUC gains for long-tail apps. Online A/B testing further validates our approach, showing a +10.52% lift in CTR and a +16.30% improvement in CVR, demonstrating PCR-CA's effectiveness in real-world deployment. The new framework has now been fully deployed on the Microsoft Store.
- Asia > Middle East > UAE > Dubai Emirate > Dubai (0.05)
- Asia > China (0.05)
- North America > United States (0.04)
- Information Technology (0.46)
- Leisure & Entertainment (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)
- North America > Canada (0.04)
- Asia > Japan (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)